Motivation

Wines are probably the most noble, romantic and healthy alcoholic drinks. The taste of wine is very sophiscated: a distinct fragrance emerged from a mixed of sour, bitter, and astringent. Every bite can taste a bit different. It’s always difficult to describe the taste of wine, not to say to illustrate the taste of good wines. Therefore, I am very excited to know that there exist a data set that documented the chemical composition as well as the sensory quality of wines. This would certainly be a great chance to reveal the mistery of the taste of wines.

Summary of the data set:

The data set was originally from Vinho Verde, a chateau in Portugal. They provide data for red wines as well as white wines. This allows me to explore the mysterious chemistry of taste in both wines. Moreover, the direct comparsion of the analysis would also help us to understand their difference.

These are the chemicals components or properties analyzed in the data set:

It’s worth to note that the units of measure in the original file is mass per volumn (g/\(\mathrm{dm}^3\) or mg/\(\mathrm{dm}^3\)). As 1 mg/\(\mathrm{dm}^3\) is known to be 1 ppm (parts per million), a common way to describe small concentrations, I will convert all the units to ppm in this report.

  1. fixed.acidity (tartaric acid, ppm)
  2. volatile.acidity (acetic acid, ppm)
  3. citric.acid (ppm)
  4. residual.sugar (ppm)
  5. chlorides (sodium chloride, ppm)
  6. free sulfur dioxide (ppm)
  7. total sulfur dioxide (ppm)
  8. density (g/\(\mathrm{cm}^3\))
  9. pH
  10. sulphates (potassium sulphate, ppm)
  11. alcohol (% by volume)

The quality is rated with 0-10 value. 0 stands for worst and 10 is best.

Overview of the chemicals

The main purpose is to find how different chemicals affect the wine quality. A very simple approach would be just looking at the correlations coefficients between quality and different chemicals. As we know, a positive correlation coefficient would imply an improved quality and a greater value suggests higher contribution. Let’s first starts with the table for red wines.

red wine
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
fixed.acidity 1.00 -0.26 0.67 0.11 0.09 -0.15 -0.11 0.67 -0.68 0.18 -0.06 0.12
volatile.acidity -0.26 1.00 -0.55 0.00 0.06 -0.01 0.08 0.02 0.23 -0.26 -0.20 -0.39
citric.acid 0.67 -0.55 1.00 0.14 0.20 -0.06 0.04 0.36 -0.54 0.31 0.11 0.23
residual.sugar 0.11 0.00 0.14 1.00 0.06 0.19 0.20 0.36 -0.09 0.01 0.04 0.01
chlorides 0.09 0.06 0.20 0.06 1.00 0.01 0.05 0.20 -0.27 0.37 -0.22 -0.13
free.sulfur.dioxide -0.15 -0.01 -0.06 0.19 0.01 1.00 0.67 -0.02 0.07 0.05 -0.07 -0.05
total.sulfur.dioxide -0.11 0.08 0.04 0.20 0.05 0.67 1.00 0.07 -0.07 0.04 -0.21 -0.19
density 0.67 0.02 0.36 0.36 0.20 -0.02 0.07 1.00 -0.34 0.15 -0.50 -0.17
pH -0.68 0.23 -0.54 -0.09 -0.27 0.07 -0.07 -0.34 1.00 -0.20 0.21 -0.06
sulphates 0.18 -0.26 0.31 0.01 0.37 0.05 0.04 0.15 -0.20 1.00 0.09 0.25
alcohol -0.06 -0.20 0.11 0.04 -0.22 -0.07 -0.21 -0.50 0.21 0.09 1.00 0.48
quality 0.12 -0.39 0.23 0.01 -0.13 -0.05 -0.19 -0.17 -0.06 0.25 0.48 1.00

Let’s first focus on the row of quality. You can see that “fixed acidity” is 0.12, suggesting that fixed acidity (in this case, tartaric acid), in general can slightly improve the flavor of red wine. This is consistent with our knowledge that tartaric acid plays an important role in maintaining the chemical stability, color and taste of the wine. Citric acid can add freshing flavors to the wine so the coefficient is also positive. The terrible smell from acetic acid (labeled as volatile acid here), on the other hand, would decrease the quality of the wine. Sulfur dioxide1,2 is known as a good antimicrobial and a good antioxidant, which is essential in maintaining the quality of the wine. Therefore, the amount of added sulphate helps improve the quality (coefficient=0.25). This is just a very rough way to guess the role of each components in affecting the wine quality so very likely that only strong effects can be revealed.

The role of sugar, chlorides (here means sodium chlorides, i.e., salt) is not so clear to me how it would affects the flavor. I mean, for sure the wine would be disgusting if it’s extremely salty or sweet. But it’s certainly complicated to know how slight variation between sugar and salt changes the taste. The value of pH and density is more like the summary of all chemicals. What matters should be the composition of the solution (wine). To my surprise, the amount of alcohol shows pretty high correlation with quality (0.47)!

By looking at how these chemicals correlate with each other, we can also get a brief overview of their chemical interactions with each other. As basic chemical interactions would be quite similar for red and white wines, let’s also look at the table of white wine before we make conclusion.

white wine
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
fixed.acidity 1.00 -0.02 0.29 0.09 0.02 -0.05 0.09 0.27 -0.43 -0.02 -0.12 -0.11
volatile.acidity -0.02 1.00 -0.15 0.06 0.07 -0.10 0.09 0.03 -0.03 -0.04 0.07 -0.19
citric.acid 0.29 -0.15 1.00 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
residual.sugar 0.09 0.06 0.09 1.00 0.09 0.30 0.40 0.84 -0.19 -0.03 -0.45 -0.10
chlorides 0.02 0.07 0.11 0.09 1.00 0.10 0.20 0.26 -0.09 0.02 -0.36 -0.21
free.sulfur.dioxide -0.05 -0.10 0.09 0.30 0.10 1.00 0.62 0.29 0.00 0.06 -0.25 0.01
total.sulfur.dioxide 0.09 0.09 0.12 0.40 0.20 0.62 1.00 0.53 0.00 0.13 -0.45 -0.17
density 0.27 0.03 0.15 0.84 0.26 0.29 0.53 1.00 -0.09 0.07 -0.78 -0.31
pH -0.43 -0.03 -0.16 -0.19 -0.09 0.00 0.00 -0.09 1.00 0.16 0.12 0.10
sulphates -0.02 -0.04 0.06 -0.03 0.02 0.06 0.13 0.07 0.16 1.00 -0.02 0.05
alcohol -0.12 0.07 -0.08 -0.45 -0.36 -0.25 -0.45 -0.78 0.12 -0.02 1.00 0.44
quality -0.11 -0.19 -0.01 -0.10 -0.21 0.01 -0.17 -0.31 0.10 0.05 0.44 1.00

We can see some very strong correlations between fixed.acidity, citric acid and pH values. It’s reasonable as pH value mainly measures the amount of hydrogen ions in the solution and acids are the main source releasing hydrogen ions. Another strong correlation is between total.sulfur.dioxide and free.sulfur.dioxide. As both of them are different forms of sulfur dioxide in the solutions2, it’s not surprising that their quantities are highly related.

Another effect that the mixture of alcohol with water would decrese the density clearly demostrated by strong anti-correlation between Alcohol and density. All the other components would contribute to higher density as their quantity in the solution increases.

Overview of the data:

Quality of red and white wines

The quality distribution seems to be similar in red and white wines. The quality levels of most of the wines are 5-7. Around 4% wines with quality level 3-4. 1% red wine are with quality level 8 and 3.6% white wines have quality level 8. The best level for red wine is 8 and for white wime is 9 (only 0.1%). Considering the fact that it’s probably difficult to distinguish quality level$$1, I decided to simplfy the quality levels to “good”, “average” and “bad” by creating another field, “taste”. It’s actually not difficult to convert the current quality levels to three categories descrbing “taste”: as current quality levels reported are in the range of 3-9, we can simply assign quality level 5-7 as “average”, the lowest two quality levels (3-4) as “bad” and the highest two quality levels (8-9) as “good”.

The distribution of taste

You can see that average wines are the majority of wines. Out of 4898 white wines, 180 of them are good wines (\(3.67~\%\)) and 183 of them are bad wines (\(3.73~\%\)).
Out of 1599 red wines, only 18 are good (\(1.13~\%\)) and 63 are bad (\(3.94~\%\)). This tells us that it seems to be more difficult to have good quality of red wines. Moreover, the small amount of good red wines probably wont’t provide enough information for the exploration of red wines.

Overview of the distribution of different chemicals in red and white wines:

Before we dive into more details about each compenents, I would like to first have a brief view about how these components distribute in red and white wines. I think box plots are the best way of showing all of these in one figure. The median, first quantile and third quantile from the box itself, already give us some idea about the distribution of the data. We can also see how many extreme values are there and how extreme they are. The side by side comparison between red and white wines particularly point out their difference. All these extreme points will be excluded from the plots exploring deeper about these components later on for better visualization and understanding. Therefore, I think it’s a great point to show the figure here.

Another way to show the distribution (besides histogram) is a summary table. It shows similar information as the box plot but you can see actual numbers in the summary table.

Summary of each components

numerical data

Summary of chemical components
Min 1stQu Median Mean 3rdQu Max
fixed.acidity 3.800 6.400 7.000 7.215 7.700 15.900
volatile.acidity 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
citric.acid 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600
residual.sugar 0.600 1.800 3.000 5.443 8.100 65.800
chlorides 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
free.sulfur.dioxide 0.00100 0.01700 0.02900 0.03053 0.04100 0.28900
total.sulfur.dioxide 0.0060 0.0770 0.1180 0.1157 0.1560 0.4400
density 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
pH 2.720 3.110 3.210 3.219 3.320 4.010
sulphates 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
alcohol 8.00 9.50 10.30 10.49 11.30 14.90

From the box plots and the summary table, you can see that: red wine shows more fixed and volatile acidity than white wine. But most white wines show higher citric acid than red wines. In general, the amount of residual sugar is also higher in white wines, which also agrees with our experience. On the other hand, the free as well as fixed sulfur dioxide value are much higher in white wine. White wines show lower addictive sulphates.

It’s worth to point out that all the chemicals analyzed here are significanly different in red and white wines (2 sample t-test, all showing p-values<0.01), suggesting the distinct composition of red and white wines.

categorical data

quality_count
3 30
4 216
5 2138
6 2836
7 1079
8 193
9 5
taste_count
bad 246
average 6053
good 198
type_count
red 1599
white 4898

You can see that we have three times more data on white wines than red wines. Most wines are in the middle quality level (5, 6) and average taste. Wines with great or poor qualities are comparatively rare.

Individual group exploration

Based on the correlations and some common sense, we can probably group these chemical into 4 types:

  1. acidity related: fixed.acidity, volatile.acidity, citric acid
  2. sulfur related: total.sulfur.dioxide, free.sulfur.dioxide, sulfates
  3. composition related: density, pH
  4. others: sugar, chlorides, alcohol

I would explore the data by these groups.

Acidity

From the box plot we know that both fixed.acidity and volatile.acidity are higher in red wines. But white wines seem to have higher citric.acid, the flavor to make you feel fresh. Let’s have a closer look at how the distributtions of these acids vary in wines with different qualities.

Even though the sample sizes of white wines are almost three times more than that of red wines, the distributions in fixed.acidity (i.e., tartaric acid) and volatile.acidity (i.e., acetic acid) clearly narrower for white wines. Compared to red wines, white wines show similar but more centered distribution of citric acid. Most red wines actually have less citric acid than white wines. This kind of explains the totally different taste between red and white wines: red wines seem to have mixing taste of sour, bitter and astringent, whereas white wines seem to be more refreshing.

You can see clearly that the distribution gets broader and the mean gets larger as the quality decreases in the volatile acidity. The range of distribution in fixed.acidity and citric acid of red wines are the same for different qualities. This suggests that the control of acetic acid is critical for the quality of red wines. This actually make sense, considering of the unpleasant smell from acetic acid. On the other hand, you can also see this boradening-distribution phenomennon for the citric acid distribution in white wines. Therefore, it is citric acid that controls the quality of white wines. It’s worth to note that the values of fixed acidity and volatile acidity become larger for worse white wines. Therefore, even though the range of these two acids are not so critical, too much of them would still destroy the quality.

Sugar and Salt

We all know that the balance between sugar and salt is important for tasty food. How about wines? Besides salt, sugar also plays an important role in balancing sour. What would sugar relates with acids in wines?

First let’s look at the amount of sugar and salt in wines with different qualities. I tried to color by different quality levels but it didn’t provide more information. Let’s keep it simple by just looking at three taste levels: bad, average and good.

The amount of sugar is almost 100 times higher than the amount of salt, which probably explains why we never feel salty for the wines.
It’s interesting to note that the distribution of good wines actually more similar as bad wines than average wines, suggesting that sugar and salt are uncritical in determining the quality of wines.

Now let’s have a closer look at how salt and sugar as well as sugar and acids are balanced in wines with different qualities.

Good white wines show much limited range of salt (i.e., chlorides) than average and bad white wines but the range of sugar is less critical. The amount of sugar is much higher in white wines than red wines but the salt concentration is usually higher in red wines. This provides another evidence supporting the difference in tastes between red and white wines.

It’s interesting to note that the shape of the scattered data is very different for red and white wines: red wines scattered horizontally v.s. white wines scattered vertically, indicatin totally different critical components. It’s important to control the sugar amount in red wines but the amount of citric acid is unimportant (good red wines show similar citric acid range as average and bad red wines). It’s the other way around for white wines: good wines particularly show narrower range of citric acid than worse wines but the range of sugar doesn’t matter.

This observation is quite different from what I had in mind but seems to be reasonale after some thoughts. The stereotype of red wines would never be sweet. It’s sour in a complicated way. White wines, on the other hand, can be quite sweet and still taste good.

Alcohol

I don’t know if anyone noticed, but I always plotted the y axis of histograms with log scale. Finally we only look at the distribution of alcohol alone, let me show you why. The first row shows the histograms with normal scale in the y axis and the first row shows the same histograms but with log scale in the y axis. The very small amount of good wines is very difficult to see with normal scales. The shape of distributions change as we changed the y scale but it doesn’t matter for comparison among different quality groups.

The distribution of alcohol doesn’t seem to be very different based on their quality.

However, if we plot the amount of alcohol for all wines sorted by their quality, you can see clearly an increasing trend for both red and white wines. I also looked at this kind of plots for all the other variables but none of them showed any clear trend in different quality levels. This confirms the high correaltion between alcohol and qualiity we observed in the very beginning. As alcohol came from wine fermentation, more alcohol might imply a maturer fermentation process, which in general leads to a better taste.

Further exploration

Source of pH

The distribution of pH is simialr and ranges around \(3-3.5\) for both red and white wines. The low pH came from their composition of many acids (e.g., tartaric acid, citric acid, acetic acid) or components that can be acid sources (e.g., sulfates, sulfur dioxides) in the chemicals listed in the data. So far we know how the amount of different acids changes in red and white wines of different qualities. But can we identify which acid is most crucial to the quality of wine?

I would try to do this by investigating the contribution of different acids to the pH value. The value of pH is the quantified value defining the acidity. As the definition of pH value is \(-\log[H^+]\), we can expect to see a linear relation between \(\exp(-pH)\) and the amount of acids. So what I did was: I grouped the wines according to their type (white, red) and taste (good, average, bad), did linear regression between acid and pH value and finally I used the \(R^2\) value as indicator of the fitting. I did this for different types of acids: “fixed.acidity”, “citric.acid”, “sulphates”, “volatile.acidity”, “total.sulfur.dioxide”. I increased the variable for the linear regression by gradually adding acid (with the order above) into the fitting of linear regression. This means, fiexed.acidity was the first acid in the linear regression and it was a single variable linear regression. The total.sulfur.dioxide is the last acid added in the linear regression and it was a linear regression with 5 variables. I also added “chlorides”, which supposed to be a neutral ion having no effect in the pH value.

Different acid contribution revealed by the r-square value of linear regression

A high r-square value (R2) suggests that we included enough sources of acid to account for the pH value. You can see in the case of red wines, tartaric acid contributes to almost \(70~\%\) of the acid source. Volatile acidity and total sulfur together contribute to another \(10~\%\). Bad red wines show quite different acid composition: tartaric acid only contribute to about \(40~\%\) of acidity, citric acid and sulphates together add about \(30~\%\) more. Tartaric acid is also the main acid source of average red wines. The slight increase in the r-square value after including chlorides might due to the increased number of fitting variables or some complicated but not so significant chemistry in the wine.

The fitting of white wines are much poorer, suggesting these acids analyzed here are not the dertermining factors for the acidity in white wines. We can still have a brief understanding about the roles of the acids listed here: tartaric acid still the top contributors in the acids listed here. Sulphates also contributes quite a bit in all the wines. It’s very interesting to note that the shape of bad white wines looks the same as the shape of good red wines, which again confirms the very different taste of red and white wines.

As there are still many acid sources in the wines (e.g., Malic acid, Lactic acid), the r-square value could probably be improved by including more completed source of acids or using a more complicated model (e.g, by considering the dissociating rate of acid).

Secret of good wines - the golden ratio?

As you probably noticed from the scatter plots above, many bad wines actually show the same scatters as good wines. This means, controlling the range of these chemical components dones’t guarantee the good quality. Probably there are some golden ratios for different combination of chemicals. What can we do to know something about this?

What I tried to do was, based on the range of different chemicals in good wines, I removed data that not in the range. This leads to 3112 out of 4898 white wines and 185 out of 1599 red wines. We can further investigate the reasons separating good and bad wines with these data, as thees wines all have chemical components within the “great range” (the range for good wines). Then I calculated the correlation coefficients of each pair of chemicals in different taste groups (good, average and bad wines) and for different types (red and white wines).

Pair correlations of different chemicals
(for wines showing all chemical components in the same range as good wines)

letter a b c d e f g h i j k
component fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol

This plot allows us to see the how the correlation in different chemicals varies for good and bad wines. We can know the variation of different quality of taste simply by looking at the length of the black lines, as the black lines link the largest and the smallest correlation coefficients.

The difference between different taste quality groups are much smaller in white wines, suggesting the critical role of details in deciding the quality of white wines. About \(64~\%\) white wines with all components in the same range of good wines but only \(3~\%\) of them are actually good wines. On the contrary, the composition of bad red wines are quite different from good red wines so there is no bad red wines reach all the criteria of good red wines.

The variations are large and with the same trend in white and red wines are: anti-correlation between fixed.acidity and alcohol (a-k), anti-correlation between volatile acidity and density (b-h), correlation between volatile.acidity and alcohol (b-k). The variations are large and with the oppisite trend in white and red wines is between residual.sugar and sulphates, which is positive correlated for white wines and negative correlated for red wines (d-j).

As you can see, not all the relations are straight forward and the critical relation in red wines most often is unimportant for white wines. For instances, good red wines show opposite correlations from bad wines in: b-h, b-k, c-e, c-h, c-k, d-i, e-f, e-g, e-j, e-k, whereas good white wines show opposite correlations from bad wines in: a-i, b-h. Bad red wines show much smaller correaltion than good red wines in c-f, c-g, f-i, g-i, i-k, whereas bad white wines show much smaller correaltion than good red wines in b-e, b-k, d-i, d-j, d-k, h-i. This again emphasize the complicated taste of wines and the difference between red and white wines.

Final plots and Summary

Here I choose three plots to summarize the findings:

1. Wine composition

First summarize the composition of red and white wines with different qualities.

Here I plotted all chemicals in the dataset except for alcohol. The reason for excluding alcohol from the plotted is mainly due to its different unit.

White wines have higher amount of total sulfur dioxide, free sulfur dioxide, and residual sugar than red wines. On the other hand, white wines also show broader distributions of volatile acidity, residual sugar, chlorides and sulphates than red wines.

Concerning the facts that controlling the quality of wines, you can see that
good red wines tend to have higher amount of sulphates, residual sugars and citric acids but lower amount of total sulfur dioxide, free sulfur dioxide and chlorides than worse wines.

As there are more good white wines, the comparison of distribution range among different quality of wines clearly points out that the amount of sugar and sulphates are not crtical in controlling the quality of white wines. And we know that good white wines tend to have higher fixed acidity, citric acidity, citric acid, chlorides and total sulfur dioxide.

2. Source of pH:

As acids are the most common components for all the chemicals listed here, I would like to use the second final plot to summarize the compositions of acids in different wines.

Different acid contribution revealed by the r-square value of linear regression

This plot shows us that:

  1. These acids analyzed here are not the dertermining factors for the acidity in white wines. We can still
  2. Tartaric acid is the top contributors in the acids listed here. Sulphates also contributes quite a bit in all the wines.
  3. The shape of bad white wines looks the same as the shape of good red wines, which again confirms the very different taste of red and white wines.

3. Golden ratio:

Finally, we should keep in mind that having chemicals in the right range is insufficient to be good wines, the ratio between certain chemicals matters. Therefore, I use the correlation between different chemicals in red and white wines of different tastes as the last final plots.

Pair correlations of different chemicals
(for wines showing all chemical components in the same range as good wines)

letter a b c d e f g h i j k
component fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol

This plot again emphasizes the very different standard for red and white wines.


Reflection

As I want to make the comparison between red and white wines, I used a lot of facet plots. It’s makes the plotting code very tidy and I like this a lot. I also used lots lof melt to reshape the data for facet plots. But in the last two analysis: “source of pH” and “secret of good wines”, I actually wrote several functions to create data structured in the way that works for facet plot. Probably it’s also not a bad idea to write functions for plots then be more flexible with the data.

The statistics is very simple in R but it took me lots of time to figure out how to retrieve data from the results of statistic tests. For example, to compute all paired correlation coefficients for the whole data, the code can be simply “cor(data)”! But I struggled to retrieve these data and ends up with turning the data into dataframe, a form I am familiar with. Another example is linear regression: “mtable”" summarizes very well all the results of different regressions but I only found how to get the fitting coefficients and r-square values for one test.

This data set is really nice for revealing the influence of different chemicals in the quality of wines and I did learn a lot more about wines in the analysis. However, it actually doesn’t help us to choose a good wine in the shopping. Mainly because the only thing labeled in the wine bottle is the amount of alcohol (actually this appies to all alcoholic drinks). I was actually very surprised considering that even mineral water labels all the details about different ions inside. Therefore, to guide people in selecting good wines, analysis should focus on information provided on the wines, such as the winery, the country, the year and the quality or people’s preference of wines.